Lets have plots appear inline:
In [1]:
%matplotlib inline
We're going to need os
, numpy
, matplotlib
, skimage
, torch
, torch.nn
, torch.nn.functional
and torchvision
.
In [2]:
import os
import numpy as np
from matplotlib import pyplot as plt
import skimage
import torch, torch.nn as nn, torch.nn.functional as F
import torchvision
import imagenet_classes
In [3]:
torch_device = torch.device('cuda:0')
In [4]:
IMAGE_PATH = os.path.join('images', 'P1013781.JPG')
# Extract a 896 x 896 block surrounding the peacock
img = plt.imread(IMAGE_PATH)[1800:2696,652:1548]
plt.imshow(img)
plt.show()
VGG networks consist of convolutional layers that use 3x3 convolutional kernels and pad the input to ensure that the output size remains the same as the input size. Since a 3x3 convolution will reduce the size of the image by two pixels, this means that the image must be zero-padded by 1 pixel at each side.
Convolutional networks in PyTorch expect images to come in the form of 4-dimensional arrays rather than 3-dimensional ones. The dimensions are: (sample, channel, height, width)
. The sample
dimension allows you to stack a number of images in a mini-batch so that you can predict (or train) a number of images in one go. The channel
dimension divides an image into one plane per channel, where channels are R, G and B for the input or more complex representations corresponding to filters further down the network. The height
and width
dimensions are the Y
and X
axes of the image.
For example: to store 128 RGB images of 224 x 224, you would need an array of size (128, 3, 224, 224)
.
The torchvision
models expect images to be standardised; the RGB values must be standardised; the mean RGB of the ImageNet dataset should be subtracted and the values should be scaled by the reciprocal of the standard deviation. These values are constants used for all of the torchvision models:
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
In [5]:
# Build it, requesting that the pre-trained model weights are loaded
# The call to the `to` method moves it onto the GPU
vgg16_net = torchvision.models.vgg.vgg16(pretrained=True).to(torch_device)
# Call the eval() method; we are not training
vgg16_net.eval()
# Also, set a variable
MODEL_MEAN = np.array([0.485, 0.456, 0.406])
MODEL_STD = np.array([0.229, 0.224, 0.225])
There are a few transformations we must perform on the image to get it into a form that VGG-net can operate on:
uint8
type to np.float32
(sample, channel, height, width)
In [6]:
def vgg_prepare_image(im, image_mean, image_std, image_size=224):
# If the image is greyscale, convert it to RGB
if len(im.shape) == 2:
im = im[:, :, np.newaxis]
im = np.repeat(im, 3, axis=2)
# Convert to float type
im = skimage.util.img_as_float(im)
# Scale the image so that its smallest dimension is the desired size
h, w, _ = im.shape
if h < w:
im = skimage.transform.resize(im, (image_size, w * image_size / h), preserve_range=True)
else:
im = skimage.transform.resize(im, (h * image_size / w, image_size), preserve_range=True)
# Crop the central `image_size` x `image_size` region of the image
h, w, _ = im.shape
im = im[h//2 - image_size // 2:h // 2 + image_size // 2, w // 2 - image_size // 2:w // 2 + image_size // 2]
rawim = im.copy()
# Shuffle axes from (height, width, channel) to (channel, height, width)
im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1)
# Subtract the mean and divide by the std-dev
# Note that we add two axes to the mean and std-dev for height and width so that they broadcast with the image array
im = (im - image_mean[:, None, None]) / image_std[:, None, None]
# Add the sample axis to the image; (channel, height, width) -> (sample, channel, height, width)
im = im[None, ...]
return rawim, im.astype(np.float32)
Transform the image:
In [7]:
raw_img, img_for_vgg = vgg_prepare_image(img, image_mean=MODEL_MEAN, image_std=MODEL_STD)
plt.imshow(raw_img)
plt.show()
Run the image through the network, generating a matrix of probabilities; each row of the matrix represents the probabilities of the corresponding image from the batch.
We must first convert img_for_vgg
-- which is a NumPy array -- to a PyTorch tensor. After that we apply the network by calling it as if it were a function to generate logits. We then apply the F.softmax
function to get probability vectors. Finally we convert back to a NumPy array.
In [9]:
# We don't need gradients here as we are only performing inference/prediction
with torch.no_grad():
t_im = torch.tensor(img_for_vgg, dtype=torch.float, device=torch_device)
pred_logits = vgg16_net(t_im)
pred_prob = F.softmax(pred_logits, dim=1) # dim=1: normalize over the probability vector axis
pred_prob = pred_prob.detach().cpu().numpy() # detach from gradients, move to CPU, convert to NumPy
print(pred_prob.shape)
Use np.argmax
to get the index of the class with the maximum probability:
In [10]:
pred_cls = np.argmax(pred_prob, axis=1)
print('Predicted class index {} with probability {:.2f}%, named "{}"'.format(
pred_cls[0], pred_prob[0, pred_cls[0]]*100.0, imagenet_classes.IMAGENET_CLASSES[pred_cls[0]]))